Search CORE

Online Research Database In Technology

Comparative analysis and visualization of multiple collinear genomes

Author: Collaborative Cross Consortium
DL Aylor
F Hsu
Fernando Pardo-Manuel de Villena
H Yang
H Yang
J Wang
Jeremy R Wang
JR Wang
KA Frazer
LD Stein
Leonard McMillan
M Krzywinski
ME Skinner
R Brodie
SF Altschul
SM Dombrowski
T Hubbard
WJ Kent
Publication venue: BioMed Central
Publication date: 21/03/2012
Field of study

Abstract Background Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research. Results We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations. Conclusions Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains

Carolina Digital Repository

Enrichment of homologs in insignificant BLAST hits by co-complex network alignment

Author: A Bateman
A Ruepp
B Snel
Berend Snel
EV Koonin
HW Mewes
I Wapinski
J Boekhorst
J Espadaler
J Soding
JB Pereira-Leal
Jos Boekhorst
KP Byrne
L Fokkens
L Li
L Matthews
Like Fokkens
M Ashburner
M Boube
M Campillos
M Kroiss
M Remm
P Smits
R Singh
R Szklarczyk
RA Notebaart
S Bandyopadhyay
Sandra MC Botelho
SF Altschul
SF Altschul
T Gabaldon
T Hubbard
Y Chen
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Homology is a crucial concept in comparative genomics. The algorithm probably most widely used for homology detection in comparative genomics, is BLAST. Usually a stringent score cutoff is applied to distinguish putative homologs from possible false positive hits. As a consequence, some BLAST hits are discarded that are in fact homologous. Results Analogous to the use of the genomics context in genome alignments, we test whether conserved functional context can be used to select candidate homologs from insignificant BLAST hits. We make a co-complex network alignment between complex subunits in yeast and human and find that proteins with an insignificant BLAST hit that are part of homologous complexes, are likely to be homologous themselves. Further analysis of the distant homologs we recovered using the co-complex network alignment, shows that a large majority of these distant homologs are in fact ancient paralogs. Conclusions Our results show that, even though evolution takes place at the sequence and genome level, co-complex networks can be used as circumstantial evidence to improve confidence in the homology of distantly related sequences.</p

arXiv.org e-Print Archive

Random-phase approximation and its applications in computational chemistry and materials science

Author: A Becke
A Bohr
A Grüneis
A Grüneis
A Gulans
A Görling
A Heßelmann
A Heßelmann
A Heßelmann
A Kojevnikov
A Marini
A Ruzsinszky
A Ruzsinszky
A Salam
A Seidl
A Stan
A Szabo
A Szabo
A Tkatchenko
AD McLachlan
AJ Cohen
AK Wilson
AL Fetter
B Holm
BG Janesko
BG Janesko
C Lee
C Pisani
CF Weizsäcker von
Christian Joas
D Bohm
D Bohm
D Bohm
D Bohm
D Feller
D Foerster
D Lu
D Pines
D Pines
D Pines
DC Langreth
DC Langreth
DE Beck
DL Freeman
E Fermi
E Fermi
E Runge
E Wigner
E Wigner
E Wigner
E Wigner
E Wohlfarth
F Aryasetiawan
F Bloch
F Furche
F Furche
F Furche
F Göltl
F Mittendorfer
F Weigend
G Jansen
GE Scuseria
GS Blackman
H Eshuis
H Eshuis
H Eshuis
H Fröhlich
H Jiang
H Jiang
H Miyazawa
H Steininger
H-J Kim
H-V Nguyen
H-V Nguyen
HA Bethe
HF Wilson
I Røeggen
J Cízek
J Goldstone
J Harl
J Harl
J Harl
J Hubbard
J Hubbard
J Hubbard
J Hubbard
J Klimeš
J Ma
J Paier
J Paier
J Tao
J Tao
J Toulouse
J Toulouse
J Toulouse
J-Q Sun
JF Dobson
JF Dobson
JF Dobson
JF Dobson
JG Ángyán
JM Luttinger
JM Pitarke
JML Martina
JP Perdew
JP Perdew
JTH Dunning
K Lee
K Sawada
K Sawada
KA Brueckner
KS Singwi
KT Tang
L Goerigk
L Hoddeson
L Schimka
L Wolniewicz
LA Curtiss
LH Thomas
M Dion
M Fuchs
M Fuchs
M Gell-Mann
M Gell-Mann
M Gell-Mann
M Gell-Mann
M Grüning
M Grüning
M Hellgren
M Petersilka
M Rohlfing
Matthias Scheffler
NE Dahlen
NE Dahlen
O Gunnarsson
P Debye
P Debye
P García-González
P Jurečka
P Mori-Sanchez
P Nozières
P Nozières
P Nozières
P) Hohenberg
PAM Dirac
Patrick Rinke
PJ Feibelman
PT Landsberg
PY Ayala
Q-M Hu
R Armiento
R Kubo
RF Bishop
RIG Hughes
RJ Bartlett
RJ Bartlett
RM Irelan
RW Godby
RW Godby
S Baroni
S Grimme
S Grimme
S Hirata
S Ismail-Beigi
S Kurth
S Kümmel
S Lebègue
S Tomonaga
S Tomonaga
SF Boys
T Gould
T Kotani
T Miyake
T Sato
T Takatani
TM Hendersona
V Blum
W Klopper
W Kohn
W Macke
W Zhu
WT Reid
X Ren
X Ren
Xinguo Ren
Y Li
Y Zhao
Y Zhao
Y Zhao
Z Yan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

The random-phase approximation (RPA) as an approach for computing the electronic correlation energy is reviewed. After a brief account of its basic concept and historical development, the paper is devoted to the theoretical formulations of RPA, and its applications to realistic systems. With several illustrating applications, we discuss the implications of RPA for computational chemistry and materials science. The computational cost of RPA is also addressed which is critical for its widespread use in future applications. In addition, current correction schemes going beyond RPA and directions of further development will be discussed.Comment: 25 pages, 11 figures, published online in J. Mater. Sci. (2012

Copenhagen University Research Information System

MPG.PuRe

GenFlow: generic flow for integration, management and analysis of molecular biology data

Author: Achard F
Alexandre Dermargos
Altintas I
Altschul SF
Bonaldo MF
Ellis GR
Gordon D
Gouet P
Huang X
Huang X.
Hubbard T
Hugo Aguirre Armelin
João Eduardo Ferreira
Koster R
Liu L
Marcio Katsumi Oikawa
Marcos Eduardo Bolelli Broinizi
Paton NW
Payton J
Peleg M
Sheth A
Siepel A
Singh MP
Publication venue: 'FapUNIFESP (SciELO)'
Publication date: 01/01/2004
Field of study

Public Library of Science (PLOS)

Role of Duplicate Genes in Robustness against Deleterious Human Mutations

Author: A Bairoch
A Wagner
A Wagner
AI Su
AM Dudley
B Papp
C Greenman
D Graur
DE Reich
Dennis Vitkup
DQ Nguyen
E Camon
FA Kondrashov
G Jimenez-Sanchez
GC Conant
GJ Wyckoff
J Stelling
JC Fay
KA Frazer
KD Makova
LH Hartwell
LM Blank
LM Steinmetz
M Hurles
M Lynch
N Lopez-Bigas
N Maeda
NG Smith
P Yue
R Kafri
R Kafri
S Sunyaev
SF Altschul
ST Sherry
T Hubbard
T Sjoblom
TJ Hubbard
Tzu-Lin Hsiao
V McKusick
Wayne N. Frankel
WH Li
X He
Z Gu
Z Gu
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

It is now widely recognized that robustness is an inherent property of biological systems [1],[2],[3]. The contribution of close sequence homologs to genetic robustness against null mutations has been previously demonstrated in simple organisms [4],[5]. In this paper we investigate in detail the contribution of gene duplicates to back-up against deleterious human mutations. Our analysis demonstrates that the functional compensation by close homologs may play an important role in human genetic disease. Genes with a 90% sequence identity homolog are about 3 times less likely to harbor known disease mutations compared to genes with remote homologs. Moreover, close duplicates affect the phenotypic consequences of deleterious mutations by making a decrease in life expectancy significantly less likely. We also demonstrate that similarity of expression profiles across tissues significantly increases the likelihood of functional compensation by homologs

CiteSeerX

Multiple organism algorithm for finding ultraconserved elements

Author: A Sandelin
A Siepel
A Woolfe
AL Delcher
AL Delcher
B Ma
CF Cheung
D Gusfield
D Lawson
EA Glazov
EH Margulies
G Bejerano
Greg Madey
HW Mewes
JC Venter
JZ Ni
LD Stein
M Brudno
MI Abouelhoda
N Bray
Neil F Lobo
P Ferragina
RA Holt
S Kurtz
S Kurtz
S Schwartz
Scott Christley
SF Altschul
T Tran
TJP Hubbard
U Manber
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality. Results We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences. Conclusion Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

EVEREST: automatic identification and classification of protein domains in all protein sequences

Author: A Bairoch
A Barak
A Bateman
A Heger
Amir Harel
B Boeckmann
CH Wu
E Portugaly
E Portugaly
Elon Portugaly
F Servant
HM Berman
J Gracy
J Gracy
J Liu
J Liu
J Park
J Schultz
JD Thompson
JM Chandonia
Michal Linial
N Kaplan
N Nagarajan
Nathan Linial
NJ Mulder
O Dekel
O Sasson
O Sasson
O Shachar
SF Altschul
SR Eddy
TF Smith
TJ Hubbard
Y Inbar
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site

CiteSeerX

MSDmotif: exploring protein sites and motifs

Author: A Golovin
A Golovin
A Prilc
A Prlic
Adel Golovin
AG Murzin
AJ Shepherd
AV Efimov
AV Efimov
BL Sibanda
C Bystroff
CA Orengo
CG Hunter
CH Wu
CT Porter
D Schomburg
DCP Kuhn
DI Stuart
DJ Craik
EJ Milner-White
EJ Milner-White
EJ Milner-White
ELL Sonnhammer
ELL Sonnhammer
H Boutselakis
H Kaur
H Kawasaki
HM Berman
ID Kuntz
J Lee
JD Watson
JD Watson
JYL Questel
KB Li
Kim Henrick
M Clamp
MJ Hartshorn
MR Nelson
N Hulo
ND Rawlings
RD Dowell
RD Finn
S Hayward
S Zhirong
SF Altschul
SF Altschul
T Hubbard
TJ Oldfield
TL Bailey
WJ Duddy
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Protein structures have conserved features – motifs, which have a sufficient influence on the protein function. These motifs can be found in sequence as well as in 3D space. Understanding of these fragments is essential for 3D structure prediction, modelling and drug-design. The Protein Data Bank (PDB) is the source of this information however present search tools have limited 3D options to integrate protein sequence with its 3D structure. Results We describe here a web application for querying the PDB for ligands, binding sites, small 3D structural and sequence motifs and the underlying database. Novel algorithms for chemical fragments, 3D motifs, ϕ/ψ sequences, super-secondary structure motifs and for small 3D structural motif associations searches are incorporated. The interface provides functionality for visualization, search criteria creation, sequence and 3D multiple alignment options. MSDmotif is an integrated system where a results page is also a search form. A set of motif statistics is available for analysis. This set includes molecule and motif binding statistics, distribution of motif sequences, occurrence of an amino-acid within a motif, correlation of amino-acids side-chain charges within a motif and Ramachandran plots for each residue. The binding statistics are presented in association with properties that include a ligand fragment library. Access is also provided through the distributed Annotation System (DAS) protocol. An additional entry point facilitates XML requests with XML responses. Conclusion MSDmotif is unique by combining chemical, sequence and 3D data in a single search engine with a range of search and visualisation options. It provides multiple views of data found in the PDB archive for exploring protein structures.</p

Public Library of Science (PLOS)

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Author: A Delcher
A Smit
AC Darling
B Ma
C Kemena
CN Dewey
Darren P. Martin
DR Bentley
E Ohlebusch
EJ Vallender
FP Preparata
G Bejerano
G Bourque
Hachiya Tsuyoshi
I Tabus
JT Simpson
K Liolios
K Mathee
Kris Popendorf
LB Kish
M Blanchette
M Brudno
M Farach
P Pevzner
Pearson
R Rivest
RA Gibbs
RH Waterston
S Quinlan
S Schwartz
SF Altschul
T Hachiya
T Hubbard
TF Smith
W Miller
Y Osana
Yasubumi Sakakibara
Yasunori Osana
Publication venue: Public Library of Science
Publication date: 24/09/2010
Field of study

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net